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Abstract 

We show that the herding procedure 
of Welling (2009b) takes exactly the form of 
a standard convex optimization algorithm — 
namely a conditional gradient algorithm min- 
imizing a quadratic moment discrepancy. 
This link enables us to invoke convergence 
results from convex optimization and to con- 
sider faster alternatives for the task of ap- 
proximating integrals in a reproducing kernel 
Hilbcrt space. We study the behavior of the 
different variants through numerical simula- 
tions. Our experiments shed more light on 
the learning bias of herding: they indicate 
that while we can improve over herding on 
the task of approximating integrals, the orig- 
inal herding algorithm approaches more often 
the maximum entropy distribution. 



1. Introduction 

The herding algorithm has recently been presented 
by Welling (2009b) as a computationally attractive al- 
ternative method for learning in intractable Markov 
random fields models (MRF). Instead of first estimat- 
ing the parameters of the MRF by maximum like- 
lihood / maximum entropy (which requires approxi- 
mate inference to estimate the gradient of the partition 
function), and then sampling from the learned MRF 
to answer queries, herding directly generates pseudo- 
samples in a deterministic fashion with the property of 
asymptotically matching the empirical moments of the 
data (akin to maximum entropy). The herding algo- 
rithm generates pseudo-samples Xt with the following 
simple recursion: 



x t+ i e argmax (w t ,$(x)) 

x£LX 
W t +l = W t + (J,- $(x t +l), 



(1) 
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where X is the observation space; $ is a feature map 
from X to J- , which could be viewed as the vector of 
sufficient statistics for some exponential family, and fi 
is a mean vector to match (the empirical moment vec- 
tor of the same family). Unlike in frequentist learning 
of MRFs, the parameter wt never converges to a point 
in herding and actually follows a "weakly chaotic" 
walk (Welling & Chen, 2010). 

The herding updates can be motivated from two dif- 
ferent perspectives. From the learning perspective, the 
herding updates can be derived by performing fixed- 
step-size subgradient ascent on the zero-temperature 
limit of the annealed likelihood function of the MRF — 
called the "tipi function" by Welling (2009b). From 
this perspective, herding was later generalized to 
MRFs with latent variables (Welling, 2009a) as well 
as discriminative MRFs (Gelfand et ah, 2010). 

From the moment matching perspective, which has 
been explored more in details by Chen ct ah (2010), 
the herding updates can be derived as an effective 
way to choose greedily pseudo-samples x t in order 
to quickly decrease the moment discrepancy St = 
||/i - \ Y? i=1 $(xi)\\ (Chen et ah, 2010). Under suit- 
able regularity conditions, £ t decreases as 0(l/t) for 
the herding updates — this is faster than i.i.d. sampling 
from the distribution generating /i (e.g., the train- 
ing data) which would yield the slower 0{l/\fi) rate. 
This faster rate has been explained by negative auto- 
correlations amongst the pseudo-samples and was used 
by Chen et ah (2010) to sub-select a small collection of 
representative "super-samples" from a much larger set 
of i.i.d. samples. We make the following contributions: 

- We show that herding as described by Eq. (1) is 
equivalent to a specific type of conditional gradi- 
ent algorithm (a.k.a. Frank- Wolfe algorithm) for the 
problem of estimating the mean (i. This provides a 
novel understanding and another explicit cost func- 
tion that herding is minimizing. 

- This interpretation yields improvements, for the 
task of estimating means, with other faster variants 
of the conditional gradient algorithm, which lead to 
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non-uniform weights, one based on line-search, one 
based on an active-set algorithm. 

- Based on existing results from convex optimization, 
we extend and improve the convergence results of 
herding. In particular, we provide a linear con- 
vergence rate for the line-search variant in finite- 
dimensional settings and show how the conditions 
assumed by Chen et al. (2010) in fact never hold in 
the infinite-dimensional setting. 

- We run experiments that show that algorithms 
which estimate faster the mean than herding gener- 
ate samples that are not better (and typically worse) 
than the ones obtained with herding when evaluated 
in terms of the ability to approximate a sample with 
large entropy, a property which (if or when satisfied 
by herding) could be the basis for an interpretation 
of herding as a learning algorithm (Welling, 2009b). 

2. Mean estimation 

We start with a similar setup as Chen et al. (2010), 
where herding can be interpreted as a way to approx- 
imate integrals of functions in a reproducing kernel 
Hilbert space (RKHS). We consider a set X and a map- 
ping <J> from X to a RKHS T . Through this mapping, 
all elements of T may be identified with real func- 
tions / on X defined by f(x) = (/, $(#)), for x G X. 
We denote by k : (x,y) >->■ k(x,y) the associated pos- 
itive definite kernel. Note that the mapping $ may 
be explicit (classically in low-dimensional settings) or 
implicit — where the kernel trick can be used, see Sec- 
tion 4.3 and Chen et al. (2010). 

Throughout the paper, we assume that the data is 
uniformly bounded in feature space, that is, for all 
x G X, ||$(x)|| < R, for some R > 0; this condition is 
needed for the updates of Eq. (1) to be well-defined. 

We denote by M. C T the marginal polytope 
(Wainwright & Jordan, 2008; Chen et al., 2010), i.e., 
the convex- hull of all vectors $(x) for x € X. Note 
that for any / € J 7 , we have 

sup/(x) = sup (f,g), 

and that \f(x)\ = |(/,$(x))| ^ \\f\\R for all x G X 
and / G J- (i.e., all functions with finite norm are 
bounded). 

Extreme points of the marginal polytope. In 

all the cases we consider in Section 5, it turns out that 
all points of the form $(x), x G X arc extreme points 
of M. (see an illustration in Figure 1). This is indeed 
always true when ||4>(x)|| is constant for all x G X (for 
example for our infinite-dimensional kernels on [0, 1] 
in Section 5.1); it is also true if $(x) contains both an 
injective feature map $(.t) and its self-tensor-product 





Figure 1. Marginal polytope in two situations: (left) fi- 
nite number of extreme points (typical with discrete data), 
(right) polynomial kernel in one dimension, with <b(x) = 
(x,x 2 ) for xe [-1,1]. 

<&(x) £g> <&(x), which is the case in the graphical model 
examples in Section 5.2. 

Mean element and expectation. We consider 
a fixed probability distribution p(x) over X. Follow- 
ing Chen et al. (2010), we denote by fi the mean ele- 
ment (see, e.g., Smola et al., 2007): 

/ti = E p ( x )$(z) eM. 

Note that in the learning perspective, p is the empirical 
distribution on the data and so \i is the corresponding 
empirical moment vector to match. To approximate 
this mean, we consider n points Xi, . . . , x„ G X com- 
bined linearly with positive weights w± , . . . , w n that 
sum to one. These define p, the associated weighted 
empirical distribution, and fi the approximating mean: 

ft = E p{x) $(x) = £r=i Wi$(zi) G M. (2) 

For all functions / G J 7 , we then have 

E p (x)/(x) =Ep (a .) (/,*(»)) = (/i,/), 

and similarly Kp/ X \f(x) = (/&,/)• We thus get, using 
Cauchy-Schwarz inequality, 

su P/e^, ||/||=i l E p(x)/(x)-Ep (:r) /(a-)| = ||/x-/i||, 

and controlling fj, — fi is enough to control the error 
in computing the expectation for all / G J- with finite 
norm. Note that a random i.i.d. sample from p(x) 
would lead to an expected worst-case error which is 
less than ^5= — a classical result based on Rademacher 

\Jri 

averages (see, e.g. Boucheron et al., 2005). 

In this paper, we will try to find a good estimate jl of 
fi based on a weighted set of points from {$(x),x G 
X}, generalizing Chen et al. (2010), and show how this 
relates to herding. 

3. Related work 

This paper brings together three lines of work, namely 
the approximation of integrals, herding and convex op- 
timization. The links between the first two were clearly 
outlined by Chen et al. (2010), while the present pa- 
per provides the novel interpretation of herding as a 
well-established convex optimization algorithm. 





Figure 2. Two versions of two iterations of conditional gra- 
dient after moving to an initial corner <?i; left: p t — m;, 
right: line search. The minimum-norm-point algorithm 
would have converged to a after two iterations. 

Finally, in order to minimize the number of iterations, 
a variant known as the minimum-norm-point algo- 
rithm will find gt+i that minimizes J on the convex 
hull of all previously visited points, using a specific 
active-set algorithm (sec Bach, 2011, Sec. 6, for de- 
tails). For convex sets with finitely many extreme 
points, it converges in a finite number of iterations 
with higher (though still polynomial) iteration com- 
putational cost (Wolfe, 1976). 

4. Herding as a Prank- Wolfe algorithm 

To relate herding with conditional gradient algorithms, 
we consider the following optimization problem: 

1„ 



mm J(g) = -H.9 
g&M I 



/'-I 
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3.1. Quadrature/cubature formulas 

The evaluation of expectation, or cquivalently of in- 
tegrals, is a classical problem in numerical analysis. 
When the input space A" is a compact subset of W 
and p(x) is the density of the distribution with re- 
spect to the Lebesgue measure, then this is equivalent 
to evaluating the integral JV, f(x)p(x)dx. Quadrature 
formulas are aimed at computing such integrals as a 
weighted combinations of values of / at certain points, 
which is exactly the problem we consider in Section 2. 

Although a thorough review of quadrature formulas is 
outside of the scope of this paper, we mention two 
methods which are related to our approach. First, 
given a positive definite kernel and a given set of points 
(typically sampled i.i.d. from a given distribution), the 
Bayes-Hermite quadrature of O'Hagan (1991) essen- 
tially computes an orthogonal projection of p onto the 
affine hull of this set of points. This does not lead to 
positive quadrature weights, and one may thus replace 
the affine hull by the convex hull to obtain such non- 
negative weights, which we do in our experiments in 
Section 5. 

Moreover, quasi-Monte Carlo methods consider se- 
quences of so-called "quasi-random" quadrature points 
so that the empirical average approaches the inte- 
gral. These quasi-random sequences are such that 
the approximation error goes down as 0{l/n) (up 
to logarithmic terms) for functions of bounded vari- 
ation, as opposed to 0(1/ y/n) for a random sequence. 
In simulations, we used a Sobol sequence (see, e.g., 
Morokoff & Caflisch, 1994). 

3.2. Franke- Wolfe algorithms 

Given a smooth (twice continuously diffcrentiable) 
convex function J on a compact convex set A4 with 
gradient J', Frank- Wolfe algorithms are a class of it- 
erative optimization algorithms that only require (in 
addition to the computation of the gradient J') to be 
able to optimize linear functions on M . The first class 
of such algorithms is often referred to as conditional 
gradient algorithms: given an iterate gt, the minimum 
of (J'(g t ),g) over g G M. is computed, and the next 
iterate is taken on the segment between g t and g, i.e., 
g t+ i = p t gt + {l-pt)g, where p t € [0, 1]. There are two 
natural choices for p t , (a) simply taking p t = l/(t + 1) 
and (b) performing a line search to find the point in 
the segment with optimal value (either for J or for a 
quadratic upper-bound of J). These are illustrated in 
Figure 2, and convergence rates are detailed in Sec- 
tion 4.2. Moreover, for quadratic functions, the condi- 
tional gradient algorithm with step sizes p t = l/(i + 1) 
has a dual interpretation as subgradient ascent (see, 
e.g., Bach, 2011), which we outline in Section 4.1. 



(3) 



with the trivial unique solution p. A conditional gradi- 
ent algorithm to solve this optimization problem with 
stepsize p t = l/(i + 1) use the iterates 



g t+ i S argmm i (g t - p,g), 

g£M 

gt+i = (l -pt)gt + ptgt+i- 



(4) 



But these updates are exactly the same as herding via 
the change of variable g t = p — w t /t. Indeed, the 
minimizer of a linear function of a convex set g~t+i 
can be restricted to be an extreme point of Ai; this 
implies that gt+i = 3>(x*+i) for a certain Xt+i- The 
herding updates in Eq. (1) are thus equivalent to the 
conditional gradient minimization of J with step size 
given by p t = l/{t+ 1). 

With this choice of step size, we get (t + l)gt+i = 
tg t + $(x t +i), that is g t = \ YL u =i ®( x u), and we thus 
get uniform weights in Eq. (2). 

For general step-sizes p t <E [0,1], if we assume that 
we start the algorithm with po = 1 (which implies 
,9i = $(xi)), then we get that g t = Y! u =i (ut^LC 1 - 
p v -i)p u -i)$(x u ), which thus leads to non-uniform 
weights in Eq. (2), though they still sum to one. The 
conditional gradient updates in Eq. (4) can thus be 
seen as a generic algorithm to obtain a weighted set 
of points to approximate p, and traditional herding is 
the pt = l/(t + 1) step-size case. 

A second choice of step-size for t > 1 is to use a line 
search, which leads in this setting (where the global 
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unconstrained minimum happens to belong to M.) to 
= <gt-M,gt-gt+i) e r -n Thi j d t var jant of 

rt lls*+i-9tir L ' J 

herding with non- uniform weights. 

We finally comment on the initialization go for the 
updates in Eq. (4). In the kernel herding algorithm 
of Chen et al. (2010), the authors use wo = p as this 
is required to interpret the herding updates as greedily 
minimizing £ t (with the additional assumptions that 
||$(x)|| is constant). In our setting, this corresponds to 
choosing go = (which might be outside of M. , though 
this is not problematic in practice). Another stan- 
dard choice (for MRFs in particular) is to use wo = 
(.9o = m)i which means that the first point x\ is chosen 
randomly from the extreme points of M. — this is the 
scheme we used. As is common in convex optimiza- 
tion, we didn't see any qualitative difference in our 
experiments between the two types of initialization. 

4.1. Dual problem and subgradient descent 

Welling (2009b) proposed originally an algorithmic in- 
terpretation of herding as performing subgradient as- 
cent with constant step size on a function obtained 
as the zero temperature limit of the log-likelihood of 
an exponential model that he called the "tipi func- 
tion" . Our interpretation of the procedure as a Frank- 
Wolfe algorithm might therefore appear as a conflict- 
ing interpretation at first sight. To establish a natu- 
ral connection between these two interpretations, we 
can compute the Fcnchcl-dual optimization problem 
to Eq. (3). Indeed, we have (with standard arguments 
for swapping the min and max operations): 



mi", dl.9 - A*l 

g£M 2 



minmax/ T (#-/z) - -||/|| 2 

g£M /£JF 2 

max min f T (g - p) - -\\f\\ 2 
feJ 7 g£M 2 

max{mm/(x) - (f,p) - -\\f\\ 2 }. 



The dual function / i-» miii^g* f(x) - (/, p) - \\\f\\ 2 is 
1-strongly concave and non-diffcrentiable, and a nat- 
ural algorithm to maximize it is thus subgradient as- 
cent with a step size equal to ttt, which is known 
to be equivalent to the primal conditional gradient al- 
gorithm with step sizes pt = 1/(4 + 1) (Bach, 2011, 
App. A). It is therefore not surprising that herding 
is equivalent to subgradient ascent with a decreas- 
ing stepsize on this function (with the identification 
ft = gt — p = —Wt/t). The presence of the squared 
norm which is added to the "tipi function" merely re- 
flects the change of scaling between g t and w t . It is 
worthwhile mentioning that other functions, like Breg- 
man divergences, would have led to a different dual 
function hence adding a different term than a squared 
norm to the "tipi function" . 



4.2. Convergence analysis 

Without further assumptions on the problem, then the 
two choices of step sizes lead to a convergence rate of 
the form (Dunn, 1980; Bach, 2011): 



1 



11.9/. - Mir < 4 



B 2 



2 „- r M - t 

where R is diameter of the marginal polytope. Note 
that the convergence in 0(1/ 1) does not lead to im- 
proved estimation of integrals over random sampling. 
Moreover, without further assumptions, current theo- 
retical analysis does not allow to distinguish between 
the two forms of conditional gradient algorithms (al- 
though they differ a bit in practice, see Section 5). 

However, if we assume that within the affine hull of 
Ai, there exists a ball of center p and radius d > 
that is included in M. (i.e., p is in the relative inte- 
rior of Al), then both choices of step sizes yield faster 
rates than random sampling. For the version with 
line search, we actually obtain a linear convergence 
rate (Beck & Teboullc, 2004): 

±\\g t -p\\ 2 ^R 2 cxp (~^l). 

For the version without line search (pt = l/(t + 1)), 
Chen et al. (2010) shows the slower convergence rate 
in 0(l/t 2 ): 



1 



1.9* - Mil 
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2 IU " r " " d 2 t 2 ' 
Concerning the assumption that p, is in the relative 
interior of M. , we now show that in finite-dimensional 
settings, this assumption is always satisfied under rea- 
sonable conditions, while it is never satisfied in a large 
class of infinite-dimensional settings (namely for Mer- 
cer kernels). 

We first provide an equivalent definition of this condi- 
tion which is used in the proofs. Let A be the affine 
hull of Al, po the orthogonal projection of on A, 
and T the space of directions (or difference space) of 
A (i.e., T = A — po)- 1 Now there exists d > so that 
any element of A which is at distance less than d of p 
is in M. if and only if V/ £ J-", the maximum of f T g 
over g e A and \\g — p\\ ^ d is less than the maximum 
of f T g over g e M. Given the properties of A and f, 
this is equivalent to: 

V/ G J-, max / g ^ max / g 

\\g—/j.\\^.d g£M 

«*• V/G.F, (M,/)+d||/Kmax/(x). (5) 

x£X 

Proposition 1 Assume that T is finite-dimensional, 
that X is a compact topological measurable space with 



x It turns out that po is a function taking a constant 
value since the orthogonality condition yields (po,Q(x) — 
$(y)) = 0, i.e., po{x) = po(y) for all x,y £ X, by the 
reproducing property of T. 
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a continuous kernel function, and that the distribution 
p has full support on X . Then 3d > so that Eq. (5) 
is true. 

Proof sketch. It is sufficient to show that 

ft : f M> max l£ ^ f(x) — (/j, f) is a norm on T: 
as all norms are equivalent in finite dimensions, we get 
rf||/|| < ^(/) for some d > 0, yielding Eq. (5). VI is con- 
vex and positively homogeneous by construction. Now 
0(/) = implies that E p{x) [f(x) - max. y /(y)] = 0, 
and thus /(x) = max. y f{y) for x in the support of p 
(assumed to be X) using the fact that / is continuous 
(since the kernel is continuous), and so / is a constant 
function. We then have two possibilities: either 
/Jo = 0, in which case one can show that there is no 
non-zero constant functions in T\ otherwise / = a/xo 
for some a and thus the orthogonality condition 
(/iMo) = implies that a — 0. Both cases imply 
/ = 0, hence O is a norm. ■ 

Proposition 2 Assume X is a compact subspace of 
M. d , and that the kernel k is a continuous function on 
X x X. If J 7 is infinite-dimensional, then there exists 
no d > so that Eq. (5) is true. 

Proof sketch. We can apply Mercer theorem to the 
kernel k(x, y) of the projection onto the orthogonal of 
{fi, fio}- This kernel is also a Mercer kernel, and we get 
an orthonormal basis (efc)fe^i of L 2 (X) with countably 
many eigenvalues A& that are summable. Moreover, 

1/2 

(A fe efe)fc^i is known to be an orthonormal basis of the 
associated feature space J- (Cucker & Smalc, 2002), 
and for all x,y € X, k{x,y) = J^k^i ^k€-k{x)e k {y), 
with uniform convergence. This implies that for /& = 

1/2 

X k ' e k , we have ||/ fe || = 1, and (fk,Vo) = (/fc>W = °- 

If we assume that there exists d > so that Eq. (5) is 
true, then we have for all k Js 1, max l6 ^ |/fc(x)| ^ d. 
Since X is compact, there exists a cover of T with 
finitely many balls of radius d/AR. Let y be the finite 
set of centers. Since all functions f k arc Lipschitz- 
continuous with constant 2R, then for all k ^ 1, 
m&x xey \fk(x)\ ^ d - 2R x d/AR = d/2. Since y is 
finite, there exists x € y so that |/fc(x)| Js d/2 > 
for infinitely many values of k; this contradicts the 
summability of ^fc>i fk{ x ) 2 - Hence the result. ■ 

The last proposition shows that the current theory 
only supports the slower rates of 0{l/t) for the two 
conditional gradient algorithms in infinite-dimensional 
settings. On the other hand, we note that, in some 
cases, traditional herding performs empirically better 
without known theoretical justification (sec Section 5). 

4.3. Computational issues 

In order to run a herding algorithm, there are two 
potential computational bottlenecks: 

Computing expectations (/x, $(#)): in a learning 



context (empirical moment matching), these are done 
through an empirical average. In an integral evalua- 
tion context, in finite-dimensional settings, one needs 
to compute E p ( a; )$(x); while in an infinite-dimensional 
setting, following Chen et al. (2010), expectations of 
the form E p ( a; )fc(a;,j/) need to be computed. This can 
be done for some pairs of kernels/distributions, like 
the ones we choose in Section 5, but not in general. 

Minimizing (g t — /J, $(x)) with respect to x E X: 
in general, this computation may be relatively hard (it 
is for example NP-hard in the context of the graph- 
ical models we consider in Section 5). In practice, 
Chen et al. (2010) and Welling (2009a) perform local 
search, while another possibility is to perform the min- 
imization through exhaustive search in a finite sam- 
ple. Note that a convex relaxation through variational 
methods (Wainwright & Jordan, 2008) could provide 
an interesting alternative. 

5. Experiments 

The goals of these simulations are (a) to compare the 
different algorithms aimed at estimating integrals, i.e., 
assess herding for the task of mean estimation (Sec- 
tion 5.1 and Section 5.2), and (b) to briefly study the 
empirical relationship with maximum entropy estima- 
tion in a learning context (Section 5.3). 

5.1. Kernel herding on X = [0, 1] 

Problem set-up. In this section, we consider 

X = [0, 1] and the norm ||/|| 2 = J^[f^{x)} 2 dx on the 
infinite-dimensional space of functions with zero mean 
and whose z;-th derivative exists and is in L 2 ([0,1]). 
As shown by Wahba (1990), the associated kernel is 
equal to B2 " [x ^~ ) \ x ~ vl \ where B 2v is the (2i/)-th 
Bernoulli polynomial, with B 2 {x) 
B 6 (x) = : 



and 



3x 5 



5^.4 
2 x 



2 X 



_j_ 

12' 



We consider either the uniform density on [0,1], or 
a randomly selected density of the form p{x) ex 

( 5Zj =1 a i cos(2i7rx) + bi sin(2i7rx)) , for which all re- 
quired expectations may be computed in closed form. 
In particular, the mean element is computed as fi : 
x <— > E[fc(Y,x)] which may be computed in closed 
form by expanding all terms in the Fourier basis. As 
for the optimization step, it consists in minimizing 
<7t(x) — /x(x) over the interval [0, 1] which can be done 
efficiently with exhaustive search. 

Comparison of mean estimation procedures. 

We compare in Figure 3 two kernels, i.e., with v = 
1 (left and middle plots) and v = 3 (right plot), 
the following mean estimation procedures, and plot 
log 10 ||/1 — /i||, for two densities, the uniform den- 
sity (middle and right) and a randomly selected non- 
uniform density (left). We compare the following: 
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- cg-l/(t+l): conditional gradient procedure with 
Pt = TTT, which is the original herding procedure 
of Welling (2009a). leading to uniform weights. 

- cg-1. search: conditional gradient procedure with 
line search (with non- uniform weights). 

- min-norm-point: Minimum- norm point algorithm. 
This leads to non-uniform weights. 

- random: Random selection of points (from p(x)), 
averaged over 10 replications. 

- sobol: a classical quasi-random sequence, with uni- 
form weights. For non-uniform densities, we first 
apply the inverse cumulative distribution function. 

For all of these (except for min-norm-point), we also 
consider an extra a posteriori projection step (denoted 
by the -proj suffix), which computes the optimal set 
of non- uniform weights by finding the best approxima- 
tion of p in the convex hull of the points selected by 
the algorithm. We can draw the following conclusions: 

- Min-norm-point algorithms always perform best. 

- Conditional gradient with line search is performing 
slightly worse than regular herding. (Note that we 
are in the infinite-dimensional setting and so they 
both have 0(i _1 ) as theoretical rate.) 

- The extra projection step always significantly im- 
proves performance, and sometimes enough that 
random selection of point combined with a reprojec- 
tion outperforms regular herding (at least for v = 3, 
i.e., with a smaller feature space). 

- On the right plot, it turns out that the Sobol se- 
quence is known to achieve the optimal rate of 
0(t~ 2 ) for \\p — jl\\ 2 for the associated Sobolcv 
space (Wahba, 1990). In this situation, regular 
herding empirically achieves the same rate; how- 
ever, the theoretical analysis provided in the present 
paper or by Chen et al. (2010) does not allow to ex- 
plain or support this statement theoretically. 

Estimation from a finite sample. In Figure 4, 
we compare three of the previously mentioned herd- 
ing procedures when all quantities are computed from 
a random sample of size 1000. In plain, testing er- 
rors are computed (using exact expectations) while in 
dashed, the training errors are computed. All meth- 
ods eventually fit the empirical mean, with no further 
progress on the testing error, this behavior happening 
faster with the min-norm-point algorithm. 

5.2. Estimation on graphical models 

We consider X = { — 1, l} d and a random variable com- 
puted as the sign (in { — 1,1}) of a Gaussian random 
vector in ~R d , together with <&(:e) composed of x and of 
all of its pairwise products xx T . In this set-up, we can 
compute the expectation E p ( a; )<I > (x) in closed form, as 
the mass assigned to the positive orthant by a bivari- 
ate Gaussian distribution with correlation p, which is 
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herding procedures for kernels on [0, 1] with v = 3 and 
non-uniform density. 



-cg-1/(t+1) 

line-search 
-min-norm-point" 
random 




2000 3000 

number of samples 



5000 



Figure 5. Comparison of herding procedures on graphical 
models with 100 binary variables. See Section 5.2. 

equal to i + J- sin -1 p (Abramowitz & Stegun, 1964). 
We are here in the finite- dimensional setting and the 
faster rates derived in Section 4.2 apply. 

We generated 10000 samples from such a distribution 
and performed herding with exact expectations but 
with minima with respect to x computed over this sam- 
ple (by exhaustive search over the sample). We plot 
results in Figure 5, where we see the superiority of the 
min-norm-point procedure over the two other proce- 
dures (which include regular herding). Note that the 
line-search algorithm is slower than the l/(t + l)-rule, 
which seems to contradict the bounds. The bounds 
depend on the distance d between the mean and the 
boundary of the marginal polytopc. If this is too small 
(much like if the strong convexity constant is too small 
for gradient descent), the linear convergence rate can 
only be seen for larger numbers of iterations. 

5.3. Herding and maximum/minimum entropy 

Given a moment vector p obtained from the empirical 
mean of $(a;) on data, the goal of herding is to produce 
a pseudo-sample whose moments match p without 
having to estimate the canonical parameters of the cor- 
responding model. A natural candidate for such a dis- 
tribution is the maximum entropy distribution and we 
will compare it to the results of herding in cases where 
it can be easily computed, namely for X = { — 1, l} d 
(with d ^ 10) and either $(:r) = x e [— 1, l] d or 
<&(x) = (x,xx T ). In this setup, following Welling 
(2009a), the distribution on a; € A" is estimated by 
the empirical distribution y ■_ 1 Wi6(x = .t,). 
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Learning independent bits. We first consider 
(f>(x) = x and some specific feasible moment fj, G M.. 
It is well-known that the maximum entropy distri- 
bution is the one with independent bits. In the top 
panels of Figure 6, we compare the norm between the 
maximum entropy probability vector and the one esti- 
mated by two versions of herding, namely conditional 
gradient with stepsize p t = 1/(4 + 1) (regular herd- 
ing with uniform weights) and with line search (with 
non-uniform weights) — the min-norm-point algorithm 
leads to quantitatively similar results. We show re- 
sults in Figure 6 for a mean vector [i which is a ran- 
dom uniform vector in [— 1,1] d (left plots), and for a 
mean /i which is random with uniform (/ij + l)/2 val- 
ues in {1, 2, 3, 4, 5} x ^0- (middle plots), and for mean 
values [i which is are random with uniform (jn + l)/2 
values in {1, 2, 3,4, 5}/6 (right plots). Note that the 
difference between rational and irrational means was 
already brought up by Welling & Chen (2010) through 
the link between herding and Sturmian sequences. 

For each of the mean vector, we plot in the top plots, 
the error in estimating the full maximum entropy dis- 
tribution (a vector of size 2 d ), and in the bottom plots, 
the error in estimating the feature means (a vector of 
size d). We can draw the following conclusions: 

- For a random vector /j, (left plots), then regular 
herding (with no line search) empirically converges 
to the maximum entropy distribution. 

- For rational ratios between the means (but irra- 
tional means, middle plots), then there is no con- 
vergence to the maximum entropy distribution. 

- For rational means (right), there is no convergence 
either, but the behavior is more erratic. 

- The line-search procedure never converges to the 
maximum entropy procedure. On the opposite, it 
happens to be close to a minimum entropy solution, 
where many states have probability zero. 

Experiments considered in Figure 6 considered a single 
draw of the mean vector ii, but similar empirical con- 
clusions may be drawn from other random samples, 
and we conjecture that for almost surely all random 



vectors /j, £ [— 1, l] d (which would avoid rational ratios 
between mean values) , then regular herding converges 
to the maximum entropy distribution. The next ex- 
periment shows that this is not the case in general. 

Learning non-independent bits. We now con- 
sider $(a;) = (x, xx T ), and a certain random feasi- 
ble moment /i <G M. As before, we compare the 
norm between the maximum entropy probability vec- 
tor and the one estimated by the two versions of herd- 
ing. We present results in Figure 7 for a mean vec- 
tor obtained by the corresponding exponential family 
distribution with zero weights for the features x and 
constant weights on the features xx T . We see that the 
herding procedures, while leading to a consistent esti- 
mation of the mean vector, does not converge to the 
maximum entropy distribution and other unreported 
experiments have led to similar results. 

6. Conclusion 

We showed that herding generates a sequence of points 
which give in sequence the descent directions of a con- 
ditional gradient algorithm minimizing the quadratic 
error on the moment vector. Therefore, if herding 
is only assessed in terms of its ability to approxi- 
mate the moment vector, it is outperformed by other 
more efficient algorithms. Clearly, herding was orig- 
inally defined with another goal, which was to gen- 
crate a pseudo-sample whose distribution could ap- 
proach the maximum entropy distribution with a given 
moment vector. Our experiments suggest empirically, 
that while this is the case in certain cases, herding 
fails in other case, which are not chosen to be par- 
ticularly pathological. This probably prompts for a 
further study of herding. 

Our experiments also show that algorithms that are 
more efficient than herding at approximating the mo- 
ment vector fail more blatantly to approach a maxi- 
mum entropy distribution and they present character- 
istics which would rather suggest a minimization of the 
entropy. This suggests the question of whether there is 
a tradeoff between approximating most efficiently the 
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Figure 6. Comparison of herding procedures on the independent bit problem with d — 10 binary variables. Top: estimation 
of the maximum entropy distribution, bottom: estimation of the mean of the features Q(x). From left to right: Mean 
values are selected uniformly at random on [— 1, 1], mean values are equal to y/2 times random rational numbers in [— 1, 1], 
mean values are equal to random rational numbers in [—1,1]. 
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models with 10 binary variables. Top: estimation of the 
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mean vector and approximating well the maximum en- 
tropy distribution, or if the two goals are in fact rather 
aligned? In any case, we hope that formulating herd- 
ing as an optimization problem can help form a better 
understanding of its goals and its properties. 
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